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ABSTRACT 

BoostMap  is  a  recently  proposed  method  for  efficient  ap¬ 
proximate  nearest  neighbor  retrieval  in  arbitrary  non-  Eu¬ 
clidean  spaces  with  computationally  expensive  and  possibly 
non-metric  distance  measures.  Database  and  query  objects 
are  embedded  into  a  Euclidean  space,  in  which  similarities 
can  be  rapidly  measured  using  a  weighted  Manhattan  dis¬ 
tance.  The  key  idea  is  formulating  embedding  construc¬ 
tion  as  a  machine  learning  task,  where  AdaBoost  is  used 
to  combine  simple,  ID  embeddings  into  a  multidimensional 
embedding  that  preserves  a  large  amount  of  the  proximity 
structure  of  the  original  space.  This  paper  demonstrates 
that,  using  the  machine  learning  formulation  of  BoostMap, 
we  can  optimize  embeddings  for  indexing  and  classification, 
in  ways  that  are  not  possible  with  existing  alternatives  for 
constructive  embeddings,  and  without  additional  costs  in  re¬ 
trieval  time.  First,  we  show  how  to  construct  embeddings 
that  are  query-sensitive,  in  the  sense  that  they  yield  a  differ¬ 
ent  distance  measure  for  different  queries,  so  as  to  improve 
nearest  neighbor  retrieval  accuracy  for  each  query.  Second, 
we  show  how  to  optimize  embeddings  for  nearest  neighbor 
classification  tasks,  by  tuning  them  to  approximate  a  param¬ 
eter  space  distance  measure,  instead  of  the  original  feature- 
based  distance  measure. 

1.  INTRODUCTION 

Many  important  database  applications  require  represent¬ 
ing  and  indexing  data  that  belong  to  a  non-Euclidean,  and 
often  non-metric  space.  Some  examples  are  proteins  and 
DNA  in  biology,  time  series  data  in  various  fields,  and  edge 
images  in  computer  vision.  Indexing  such  data  can  be  chal¬ 
lenging,  because  the  underlying  distance  measures  can  take 
time  superlinear  to  the  length  of  the  data,  and  also  because 
many  common  tree-based  and  hash-based  indexing  methods 
typically  work  in  a  Euclidean  space,  or  at  least  a  so-called 
’’coordinate-space”,  where  each  object  is  represented  as  a 
feature  vector  of  fixed  dimensions. 

Euclidean  embeddings  (like  Bourgain  embeddings  [17]  and 
FastMap  [10])  provide  an  alternative  for  indexing  non-Euclidean 
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spaces.  Using  embeddings,  we  associate  each  object  with  a 
Euclidean  vector,  so  that  distances  between  objects  are  re¬ 
lated  to  Euclidean  distances  between  the  mappings  of  those 
objects.  Indexing  can  then  be  done  in  the  Euclidean  space, 
and  a  refinement  of  the  retrieved  results  can  then  be  per¬ 
formed  in  the  original  space.  Euclidean  embeddings  can 
significantly  improve  retrieval  time  in  domains  where  eval¬ 
uating  the  distance  measure  in  the  original  space  is  compu¬ 
tationally  expensive. 

BoostMap  [2]  is  a  recently  introduced  method  for  embed¬ 
ding  arbitrary  (metric  or  non-metric)  spaces  into  Euclidean 
spaces.  The  main  difference  between  BoostMap  and  other 
existing  methods  is  that  BoostMap  treats  embeddings  as 
classifiers,  and  constructs  them  using  machine  learning.  In 
particular,  given  three  objects  a,  b  and  c  in  the  original  space 
X,  an  embedding  F  can  be  used  to  make  an  educated  guess 
as  to  whether  a  is  closer  to  b  or  to  c.  The  guess  is  simply 
that  a  is  closer  to  b  than  it  is  to  c  if  F(a)  is  closer  to  F(b) 
than  it  is  to  F(c).  If  using  some  embedding  F  we  can  make 
the  right  guess  for  all  triples,  then  that  embedding  perfectly 
preserves  k-nearest-neighbor  structure,  for  any  value  of  k. 
Overall,  we  want  to  construct  embeddings  that  make  wrong 
guesses  on  as  few  triples  as  possible.  The  classification  er¬ 
ror  of  an  embedding  is  the  fraction  of  triples  on  which  the 
embedding  makes  a  wrong  guess. 

In  this  paper,  we  describe  three  extensions  of  BoostMap, 
that  can  be  used  to  improve  the  quality  of  the  embedding, 
when  the  application  is  approximate  nearest  neighbor  re¬ 
trieval  or  efficient  nearest  neighbor  classification: 

•  We  show  how  to  construct  query-sensitive  embeddings, 
in  which  the  weighted  L\  distance  used  in  the  Eu¬ 
clidean  space  depends  on  the  query.  In  a  high-dimensional 
embedding,  using  a  query-sensitive  distance  measure 
provides  an  elegant  way  to  capture  the  fact  that  dif¬ 
ferent  coordinates  are  important  in  different  regions  of 
the  space. 

•  fn  cases  where  the  ultimate  goal  is  classification  of  the 
query  based  on  its  k  nearest  neighbors,  we  show  how 
to  create  embeddings  that  are  explicitly  optimized  for 
classification  accuracy,  as  opposed  to  being  optimized 
for  preserving  distances  or  nearest  neighbors. 

•  We  describe  an  improved  method  for  selecting  the  train¬ 
ing  set  used  by  BoostMap.  In  the  original  formulation, 
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the  training  set  was  chosen  at  random. 

Database  objects  are  embedded  offline.  Given  a  query 
object  q,  its  embedding  F(q)  is  computed  efficiently  online, 
by  measuring  distances  between  q  and  a  small  subset  of 
database  objects.  In  the  case  of  nearest-neighbor  queries, 
the  most  similar  matches  obtained  using  the  embedding  can 
be  reranked  using  the  original  distance  measure,  to  improve 
accuracy,  in  a  filter-and-refine  framework  [12].  Overall,  the 
original  distance  measure  is  applied  only  between  the  query 
and  a  small  number  of  database  objects. 

We  also  describe  some  preliminary  experiments,  in  which 
we  compare  our  method  to  FastMap  [10],  using  as  a  dataset 
the  MNIST  database  of  handwritten  digits  [16],  and  using 
the  chamfer  distance  as  the  distance  measure  in  the  origi¬ 
nal  space.  Our  original  BoostMap  formulation  leads  to  sig- 
nicantly  more  efficient  retrieval  than  FastMap.  The  three 
extensions  introduced  in  this  paper,  i.e.  query-sensitive  em¬ 
beddings,  optimization  of  classification  accuracy,  and  better 
choice  of  training  set,  lead  to  additional  gains  in  efficiency 
and  classification  accuracy.  We  are  working  on  evaluating 
BoostMap  on  more  datasets,  with  different  similarity  mea¬ 
sure,  in  order  to  get  a  clearer  picture  of  its  performance  vs. 
other  existing  embedding  methods. 

2.  RELATED  WORK 

Various  methods  have  been  employed  for  similarity  in¬ 
dexing  in  multi-dimensional  datasets,  including  hashing  and 
tree  structures  [27] .  However,  the  performance  of  such  meth¬ 
ods  degrades  in  high  dimensions.  This  phenomenon  is  one  of 
the  many  aspects  of  the  “curse  of  dimensionality”  problem. 
Another  problem  with  tree-based  methods  is  that  they  typ¬ 
ically  rely  on  Euclidean  or  metric  properties,  and  cannot  be 
applied  to  arbitrary  non-metric  spaces.  Approximate  near¬ 
est  neighbor  methods  have  been  proposed  in  [14,  22]  and 
scale  better  with  the  number  of  dimensions.  However,  those 
methods  are  available  only  for  specific  sets  of  metrics,  and 
they  are  not  applicable  to  arbitrary  distance  measures. 

In  domains  where  the  distance  measure  is  computationally 
expensive,  significant  computational  savings  can  be  obtained 
by  constructing  a  distance-approximating  embedding,  which 
maps  objects  into  another  space  with  a  more  efficient  dis¬ 
tance  measure.  A  number  of  methods  have  been  proposed 
for  embedding  arbitrary  metric  spaces  into  a  Euclidean  or 
pseudo-Euclidean  space  [7,  10,  13,  20,  23,  26,  28].  Some 
of  these  methods,  in  particular  MDS  [28],  Bourgain  embed¬ 
dings  [7,  12],  LLE  [20]  and  Isomap  [23]  are  not  applica¬ 
ble  for  online  similarity  retrieval,  because  they  still  need  to 
evaluate  exact  distances  between  the  query  and  most  or  all 
database  objects.  Online  queries  can  be  handled  by  Lip- 
schitz  embeddings  [12],  FastMap  [10],  MetricMap  [26]  and 
SparseMap  [13],  which  can  readily  compute  the  embedding 
of  the  query,  measuring  only  a  small  number  of  exact  dis¬ 
tances  in  the  process.  These  four  methods  are  the  most 
related  to  our  approach. 

Various  database  systems  have  made  use  of  Lipschitz  em¬ 
beddings  [4,  8,  9]  and  FastMap  [15,  19],  to  map  objects  into 
a  low-dimensional  Euclidean  space  that  is  more  manageable 
for  tasks  like  online  retrieval,  data  visualization,  or  classifier 
training.  The  goal  of  our  method  is  to  improve  embedding 
accuracy  in  such  applications. 

3.  PROBLEM  DEFINITION 


Let  A  be  a  set  of  objects,  and  Dx{x\,X2)  be  a  distance 
measure  between  objects  xi,X2  €  A.  Dx  can  be  metric 
or  non-metric.  A  Euclidean  embedding  F  :  X  —>  Rd  is  a 
function  that  maps  objects  from  X  into  the  d-dimensional 
Euclidean  space  Rd,  where  distance  is  measured  using  a  mea¬ 
sure  DRd-  DRd  is  typically  an  Lp  or  weighted  Lv  norm. 
Given  X  and  Dx,  our  goal  is  to  construct  an  embedding  F 
that,  given  a  query  object  q,  can  provide  accurate  approxi¬ 
mate  similarity  rankings  of  database  objects,  i.e.  rankings  of 
database  objects  in  order  of  decreasing  similarity  (increasing 
distance)  to  the  query. 

3.1  Variations  of  the  Problem 

Depending  on  the  domain  and  application,  there  are  dif¬ 
ferent  variations  of  the  general  goal,  which  is  to  provide 
accurate  approximate  similarity  rankings  .  In  this  paper  we 
will  explicitly  address  three  different  versions  of  this  goal: 

•  Version  1:  We  want  to  rank  all  database  objects  in 
approximate  (but  as  accurate  as  possible)  order  of  sim¬ 
ilarity  to  the  query  object.  In  this  variant,  we  care 
not  only  about  identifying  the  nearest  neighbors  of  the 
query,  but  also  the  farthest  neighbors,  and  in  general 
we  want  to  get  an  approximate  rank  for  each  database 
object. 

•  Version  2:  We  want  to  approximately  (but  as  accu¬ 
rately  as  possible)  identify  the  k  nearest  neighbors  of 
the  query  object,  where  the  value  of  k  is  much  smaller 
than  the  size  of  the  database. 

•  Version  3:  We  want  to  classify  the  query  object  us¬ 
ing  k-nearest  neighbor  classification,  and  we  want  to 
construct  an  embedding  and  a  weighted  Li  distance 
that  are  optimized  for  classification  accuracy. 

BoostMap  can  be  used  to  address  all  three  versions.  In  the 
BoostMap  framework,  every  embedding  F  defines  a  classifier 
F  which,  given  triples  of  objects  (q,x i,£2)  of  A',  provides 
an  estimate  of  whether  q  is  more  similar  to  xi  or  to  X2- 
The  way  we  will  customize  BoostMap  to  address  each  of  the 
three  versions  is  by  choosing  an  appropriate  training  set  of 
triples  of  objects,  and  by  using  an  appropriate  definiition  of 
what  is  “similar”.  After  we  make  these  choices,  we  use  the 
same  algorithm  in  all  three  cases. 

3.2  Formal  Definitions 

In  order  to  specify  the  quantity  that  the  BoostMap  algo¬ 
rithm  tries  to  optimize,  we  introduce  in  this  section  a  quanti¬ 
tative  measure,  that  can  be  used  to  evaluate  how  “good”  an 
embedding  is  in  providing  approximate  similarity  rankings. 

Let  (q,xi,X2)  be  a  triple  of  objects  in  A'.  Let  D  be  a 
distance  measure  on  A.  For  the  first  two  variations  of  our 
problem  statement,  D  =  Dx,  but  for  the  third  variation 
we  will  use  an  alternative  distance  measure,  that  depends 
on  class  labels  (Section  9).  We  define  the  proximity  or¬ 
der  Px(q,  xi,  X2)  to  be  a  function  that  outputs  whether  q 
is  closer  to  xi  or  to  *2: 

(  1  if  D(q,xi)  <  D(q,x2)  ■ 

Px(q,xi,X2)  =  <  0  if  D(q,xi)  =  D(q,  x2)  .  (1) 

{  -1  if  D(q,xi)  >  D(q,X2)  ■ 

If  F  maps  space  A'  into  Rd  (with  associated  distance  mea¬ 
sure  DRd),  then  F  can  be  used  to  define  a  proximity  classi- 
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fier  F  that  estimates  Px  using  Pmd,  i.e.  the  proximity  order 
function  of  Rd  with  distance  DRd : 


R2  (original  space) 


R  (target  space) 


F(q,  si,  *2)  =  DRd(F(q),F(x 2))  -  DRd(F(q),  F(xi))  .  (2) 

If  we  define  sign(a:)  to  be  1  for  x  >  0,  0  for  x  =  0,  and  —1  for 
x  <  0,  then  sign (F(q,  xi,  £2))  is  an  estimate  of  Px(q,  xi,  x2). 

We  define  the  classification  error  G(F,  q,  xi,  x2)  of  apply¬ 
ing  F  on  a  particular  triple  ( q ,  xi,  x2)  as: 


G(F,q,x1,x2)=lPx{q’Xl’X2)-fn{P(q’Xl’X^\  ,  (3) 


Finally,  the  overall  classification  error  G(F)  is  defined  to 
be  the  expected  value  of  G(F,q,xi,x2),  over  all  triples  of 
objects  in  X: 


G(F) 


Y,(q,xi  ,*2)ex3G(-F,<ir,a:i,®2) 

|X|3 


(4) 


If  G(F)  =  0  then  we  consider  that  F  perfectly  preserves  the 
proximity  structure  of  X.  In  that  case,  if  x  is  the  k- nearest 
neighbor  of  9  in  J,  F(x)  is  the  k-nearest  neighbor  of  F(q) 
in  F(X),  for  any  value  of  k. 

Overall,  the  classification  error  G(F)  is  a  quantitative 
measure  of  how  well  F  preserves  the  proximity  structure 
of  X,  and  how  closely  the  approximate  similarity  rankings 
obtained  in  F(X)  will  resemble  the  exact  similarity  rank¬ 
ings  obtained  in  X.  Using  the  definitions  in  this  section, 
our  problem  definition  is  very  simple:  we  want  to  construct 
an  embedding  F  :  X  — >  Rd  in  a  way  that  minimizes  G(F). 

We  will  address  this  problem  as  a  problem  of  combining 
classifiers.  In  Sec.  4  we  will  identify  a  family  of  simple, 
ID  embeddings.  Each  such  embedding  F'  is  expected  to 
preserve  at  least  a  small  amount  of  the  proximity  structure 
of  X,  meaning  that  G(F')  is  expected  to  be  less  than  0.5, 
which  would  be  the  error  rate  of  a  random  classifier.  Then, 
in  Sec.  7  we  will  apply  AdaBoost  to  combine  many  ID 
embeddings  into  a  high-dimensional  embedding  F  with  low 
error  rate  G(F). 


4.  BACKGROUND  ON  EMBEDDINGS 

In  this  section  we  describe  some  existing  methods  for  con¬ 
structing  Euclidean  embeddings.  We  briefly  go  over  Lips- 
chitz  embeddings  [12],  Bourgain  embeddings  [7,  12],  FastMap 
[10]  and  MetricMap  [26].  All  these  methods,  with  the  ex¬ 
ception  of  Bourgain  embeddings,  can  be  used  for  efficient 
approximate  nearest  neighbor  retrieval.  Although  Bour¬ 
gain  embeddings  require  too  many  distance  computations 
in  the  original  space  X  in  order  to  embed  the  query,  there 
is  a  heuristic  approximation  of  Bourgain  embeddings  called 
SparseMap  [13]  that  can  also  be  used  for  efficient  retrieval. 

4.1  Lipschitz  Embeddings 

We  can  extend  Dx  to  define  the  distance  between  ele¬ 
ments  of  X  and  subsets  of  X.  Let  x  €  X  and  R  C  X. 
Then, 

Dx{x,R) —  rmnDx(x,r)  .  (5) 

r£R 

Given  a  subset  R  C  X,  a  simple  one-dimensional  Eu¬ 
clidean  embedding  FR  can  be  defined  as  follows: 

FR(x)  =Dx(x,R)  .  (6) 


d  c  b  a 

O  OO  o 


Figure  1:  An  embedding  Fr  of  five  2D  points  (shown  on 
the  left)  into  the  real  line  (shown  on  the  right),  using  r  as 
the  reference  object.  The  target  of  each  2D  point  on  the 
line  is  labeled  with  the  same  letter  as  the  2D  point.  The 
classifier  Fr  (Equation  2)  classifies  correctly  46  out  of  the 
60  triples  we  can  form  from  these  five  objects  (assuming 
no  object  occurs  twice  in  a  triple).  Examples  of  misclas- 
sified  triples  are:  (6,  a,  c),  ( c ,  b,  d),  (d,  b,  r).  For  example,  b  is 
closer  to  a  than  it  is  to  c,  but  Fr(b)  is  closer  to  Fr(c)  than 
it  is  to  Fr(a). 


The  set  R  that  is  used  to  define  FR  is  called  a  reference  set. 
In  many  cases  R  can  consist  of  a  single  object  r,  which  is 
typically  called  a  reference  objector  a  vantage  object  [12].  In 
that  case,  we  denote  the  embedding  as  Fr . 

If  Dx  obeys  the  triangle  inequality,  FR  intuitively  maps 
nearby  points  in  X  to  nearby  points  on  the  real  line  R.  In 
many  cases  Dx  may  violate  the  triangle  inequality  for  some 
triples  of  objects  (an  example  is  the  chamfer  distance  [5]), 
but  Fr  may  still  map  nearby  points  in  X  to  nearby  points  in 
R,  at  least  most  of  the  time  [4].  On  the  other  hand,  distant 
objects  may  also  map  to  nearby  points  (Figure  1). 

In  order  to  make  it  less  likely  for  distant  objects  to  map 
to  nearby  points,  we  can  define  a  multidimensional  embed¬ 
ding  F  :  X  — >  Rfc,  by  choosing  k  different  reference  sets 
Ri, ...,  Rk'. 

F(x)m(FR'(x),...,FR*(x))  .  (7) 

These  embeddings  are  called  Lipschitz  embeddings  [7,  13, 
12].  Bourgain  embeddings  [7,  12]  are  a  special  type  of  Lip¬ 
schitz  embeddings.  For  a  finite  space  X  containing  X  ob¬ 
jects,  we  choose  |_log|X|J2  reference  sets.  In  particular, 
for  each  i  =  1, ...,  [log  |XjJ  we  choose  [Zo<?|X|J  reference 
sets,  each  with  21  elements.  The  elements  of  each  set  are 
picked  randomly.  Bourgain  embeddings  are  optimal  in  some 
sense:  using  a  measure  of  embedding  quality  called  distor¬ 
tion,  Bourgain  embeddings  achieve  0(|X|)  distortion,  and 
there  exist  spaces  X  for  which  no  better  distortion  can  be 
achieved.  More  details  can  be  found  in  [12,  18]. 

A  weakness  of  Bourgain  embeddings  is  that,  in  order  to 
compute  the  embedding  of  an  object,  we  have  to  compute 
its  distances  Dx  to  almost  all  objects  in  X,  and  in  database 
applications  computing  those  distances  is  exactly  what  we 
want  to  avoid.  SparseMap  [13]  is  a  heuristic  simplification  of 
Bourgain  embeddings,  in  which  the  embedding  of  an  object 
can  be  computed  by  measuring  only  0(log2  n )  distances. 

Another  way  to  speed  up  retrieval  using  a  Bourgain  em¬ 
bedding  is  to  define  this  embedding  using  a  relatively  small 
random  subset  X'  C  X.  That  is,  we  choose  [_log  |  AT'  | J 2  ref¬ 
erence  sets,  which  are  subsets  of  X' .  Then,  to  embed  any 
object  of  X  we  only  need  to  compute  its  distances  to  all 
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Figure  2:  Computing  Fxl’X2(x),  as  defined  in  Equation 
8:  we  construct  a  triangle  ABC  so  that  the  sides  AB, 
AC,  BC  have  lengths  Dx(i,h),Dx(i,I2)  and  Dx(x  1^x2) 
respectively.  We  draw  from  A  a  line  perpendicular  to 
BC,  and  D  is  the  intersection  of  that  line  with  BC.  The 
length  of  the  line  segment  BD  is  equal  to  Fxl,X2(x). 

objects  of  A''.  We  used  this  method  to  produce  Bourgain 
embeddings  of  different  dimensions  in  the  experiments  we 
describe  in  [2].  We  should  note  that,  if  we  use  this  method, 
the  optimality  of  the  embedding  only  holds  for  objects  in 
A',  and  there  is  no  guarantee  about  the  distortion  attained 
for  objects  of  the  larger  set  X.  We  should  also  note  that,  in 
general,  defining  an  embedding  using  a  smaller  set  X'  can  in 
principle  also  be  applied  to  Isomap  [23],  LLE  [20]  and  even 
MDS  [28],  so  that  it  takes  less  time  to  embed  new  objects. 

The  theoretical  optimality  of  Bourgain  embeddings  with 
respect  to  distortion  does  not  mean  that  Bourgain  embed¬ 
dings  actually  outperform  other  methods  in  practice.  Bour¬ 
gain  embeddings  have  a  worst-case  bound  on  distortion,  but 
that  bound  is  very  loose,  and  in  actual  applications  the  qual¬ 
ity  of  embeddings  is  often  much  better,  both  for  Bourgain 
embeddings  and  for  embeddings  produced  using  other  meth¬ 
ods.  In  the  experiments  described  in  [2],  BoostMap  outper¬ 
formed  Bourgain  embeddings  significantly. 

A  simple  and  attractive  alternative  to  Bourgain  embed¬ 
dings  is  to  simply  use  Lipschitz  embeddings  in  which  all 
reference  sets  are  singleton.  In  that  case,  if  we  have  a  d- 
dimensional  embedding,  in  order  to  compute  the  embedding 
of  a  previously  unseen  object  we  only  need  to  compute  its 
distance  to  d  reference  objects. 

4.2  FastMap  and  MetricMap 

A  family  of  simple,  one-dimensional  embeddings,  is  pro¬ 
posed  in  [10]  and  used  as  building  blocks  for  FastMap.  The 
idea  is  to  choose  two  objects  £1,22  £  X,  called,  pivot  ob¬ 
jects,  and  then,  given  an  arbitrary  x  £  X,  define  the  em¬ 
bedding  FX1,X2  of  x  to  be  the  projection  of  x  onto  the  “line” 
*1X2-  As  illustrated  in  Figure  2,  the  projection  can  be  de¬ 
fined  by  treating  the  distances  between  x,  xi,  and  X2  as 
specifying  the  sides  of  a  triangle  in  R2: 

ZT’-l'l  tx2  {  \  Dx(x,xi)2  +  Dx(xi,x2)2  -  Dx(x,x2)2  . 

F  (a:)  =  - 2DX(X1,X2) - •  (8) 

If  X  is  Euclidean,  then  FX1,X2  will  map  nearby  points  in 
X  to  nearby  points  in  R.  In  practice,  even  if  X  is  non- 
Euclidean,  F^x i,x2)  often  still  preserves  some  of  the  prox¬ 
imity  structure  of  A'. 

FastMap  [10]  uses  multiple  pairs  of  pivot  objects  to  project 
a  finite  set  X  into  Rfc  using  only  0(kn )  evaluations  of  Dx- 
The  first  pair  of  pivot  objects  (xi,x2)  is  chosen  using  a 
heuristic  that  tends  to  pick  points  that  are  far  from  each 


other.  Then,  the  rest  of  the  distances  between  objects  in  X 
are  “updated”,  so  that  they  correspond  to  projections  into 
the  “hyperplane”  perpendicular  to  the  line  xix2-  Those  pro¬ 
jections  are  computed  again  by  treating  distances  between 
objects  in  X  as  Euclidean  distances  in  some  R> .  After  dis¬ 
tances  are  updated,  FastMap  is  recursively  applied  again  to 
choose  a  next  pair  of  pivot  objects  and  apply  another  round 
of  distance  updates.  Although  FastMap  treats  X  as  a  Eu¬ 
clidean  space,  the  resulting  embeddings  can  be  useful  even 
when  A'  is  non-Euclidean,  or  even  non-metric.  We  have  seen 
that  in  our  own  experiments  (Section  11). 

MetricMap  [26]  is  an  extension  of  FastMap,  that  maps  X 
into  a  a  pseudo-Euclidean  space.  The  experiments  in  [26]  re¬ 
port  that  MetricMap  tends  to  do  better  than  FastMap  when 
X  is  non-Euclidean.  So  far  we  have  no  conclusive  experimen¬ 
tal  comparisons  between  MetricMap  and  our  method,  partly 
because  some  details  of  the  MetricMap  algorithm  have  not 
been  fully  specified  (as  pointed  out  in  [12]),  and  therefore  we 
could  not  be  sure  how  close  our  MetricMap  implementation 
was  to  the  implementation  evaluated  in  [26]. 

4.3  Embedding  Application:  Filter-and-refine 
Retrieval 

In  applications  where  we  are  interested  in  retrieving  the 
k  nearest  neighbors  or  k  correct  matches  for  a  query  object 
q,  a  d-dimensional  Euclidean  embedding  F  can  be  used  in  a 
filter-and-refine  framework  [12],  as  follows: 

•  Offline  preprocessing  step:  compute  and  store  vector 
F(x)  for  every  database  object  x. 

•  Filter  step:  given  a  query  object  q,  compute  F(q),  and 
find  the  database  objects  whose  vectors  are  the  p  most 
similar  vectors  to  F(q). 

•  Refine  step:  sort  those  p  candidates  by  evaluating  the 
exact  distance  Dx  between  q  and  each  candidate. 

The  assumption  is  that  distance  measure  Dx  is  compu¬ 
tationally  expensive  and  evaluating  distances  in  Euclidean 
space  is  much  faster.  The  filter  step  discards  most  database 
objects  by  comparing  Euclidean  vectors.  The  refine  step  ap¬ 
plies  Dx  only  to  the  top  p  candidates.  This  is  much  more 
efficient  than  brute-force  retrieval,  in  which  we  compute  Dx 
between  q  and  the  entire  database. 

To  optimize  filter-and-refine  retrieval,  we  have  to  choose 
p,  and  often  we  also  need  to  choose  d,  which  is  the  dimen¬ 
sionality  of  the  embedding.  As  p  increases,  we  are  more 
likely  to  get  the  true  k  nearest  neighbors  in  the  top  p  candi¬ 
dates  found  at  the  filter  step,  but  we  also  need  to  evaluate 
more  distances  Dx  at  the  refine  step.  Overall,  we  trade 
accuracy  for  efficiency.  Similarly,  as  d  increases,  comparing 
Euclidean  vectors  becomes  more  expensive,  but  we  may  also 
get  more  accurate  results  in  the  filter  step,  and  we  may  be 
able  to  decrease  p.  The  best  choice  of  p  and  d,  will  depend 
on  domain-specific  parameters  like  k,  the  time  it  takes  to 
compute  the  distance  Dx,  the  time  it  takes  to  compare  d- 
dimensional  vectors,  and  the  desired  retrieval  accuracy  (i.e. 
how  often  we  are  willing  to  miss  some  of  the  true  k  nearest 
neighbors). 

5.  MOTIVATION  FOR  BOOSTMAP 

Equations  6  and  8  define  a  family  of  one-dimensional  em¬ 
beddings.  Given  a  space  of  objects  A,  each  object  r  £  X 
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can  define  a  ID  embedding,  using  Equation  6  with  R  =  {?'}. 
Each  pair  of  objects  can  also  define  a  ID  embedding,  using 
Equation  8.  Therefore,  given  n  objects  of  X,  the  number 
of  ID  embeddings  we  can  construct  using  those  objects  is 
0(n2). 

Intuitively,  we  expect  such  ID  embeddings  to  map  nearby 
objects  to  nearby  points  on  the  line,  but  at  the  same  time 
they  will  frequently  map  pairs  of  distant  objects  into  pairs 
of  nearby  points.  In  order  to  make  it  more  likely  for  dis¬ 
tant  objects  to  map  to  distant  Euclidean  points,  we  need 
to  construct  high-dimensional  embeddings.  Both  Lipschitz 
embeddings  and  FastMap  are  methods  for  constructing  a 
single,  high-dimensional  embedding,  using  simple  ID  em¬ 
beddings  as  a  building  block. 

In  Lipschitz  embeddings,  we  need  to  choose  objects  for 
each  reference  set.  Those  objects  can  be  chosen  at  ran¬ 
dom,  or  using  some  geometric  heuristics,  like  picking  objects 
so  that  they  are  far  from  each  other  [12],  or  picking  refer¬ 
ence  objects  so  as  to  minimize  stress  or  distortion  [3,  13]. 
In  FastMap,  we  choose  pivot  pairs  using  heuristics  inspired 
from  Euclidean  geometry. 

Compared  to  Lipschitz  embeddings  and  FastMap,  Boost- 
Map  has  two  important  differences: 

•  The  algorithm  produces  an  embedding  explicitly  op¬ 
timized  for  approximating  rank  information,  in  the 
form  of  approximating  the  proximity  order  of  triples. 
This  is  in  contrast  to  FastMap  and  Bourgain  embed¬ 
dings,  where  no  quantity  is  explicitly  optimized,  and 
Lipschitz  embedding  variations  that  minimize  stress 
or  distortion,  since  optimizing  those  quantities  is  not 
equivalent  to  directly  optimizing  for  ranking  accuracy. 

•  The  optimization  method  that  is  used  is  AdaBoost. 
The  main  advantages  of  Adaboost  are  its  efficiency, 
and  its  good  generalization  properties  (validated  both 
in  theory  and  in  practice),  which  make  AdaBoost  sig¬ 
nificantly  resistant  to  overfitting  [21].  Previous  ap¬ 
proaches  [3,  13]  have  used  simple  greedy  optimization, 
which  is  not  as  powerful. 

In  short,  BoostMap  optimizes  what  we  really  want  to  op¬ 
timize,  and  it  uses  a  very  powerful  optimization  method. 

6.  OVERVIEW  OF  BOOSTMAP 

At  a  high  level,  the  main  points  in  our  formulation  are  the 
following: 

1.  We  start  with  a  large  family  of  ID  embeddings.  As 
described  in  previous  sections,  this  large  family  can  be 
obtained  by  defining  ID  embeddings  based  on  refer¬ 
ence  objects  and  pairs  of  pivot  objects. 

2.  We  convert  each  ID  embedding  into  a  binary  classifier, 
using  Equation  2.  These  classifiers  operate  on  triples  of 
objects,  and  they  are  expected  to  be  pretty  inaccurate, 
but  still  better  than  a  random  classifier  (which  would 
have  a  50%  error  rate). 

3.  We  run  AdaBoost  to  combine  many  classifiers  into  a 
single  classifier  H ,  which  we  expect  to  be  significantly 
more  accurate  than  the  simple  classifiers  associated 
with  ID  embeddings. 

4.  We  use  H  to  define  a  d-dimensional  embedding  Eout, 
and  a  weighted  L\  distance  measure  DRd.  It  is  shown 


Given:  (oi,  yi), . .  . ,  (ot,  j/t);  ot  €  Q,  yt  6  {-1, 1}. 

Initialize  ic;,  i  =  |,  for  i  —  1, . . . ,  t. 

For  j  =  1, . . . ,  J: 

1.  Train  weak  learner  using  training  weights  UHj. 

2.  Get  weak  classifier  hj  :  X  — >  R. 

3.  Choose  o.j  £  R. 

4.  Set  training  weights  Wij+ i  for  the  next  round  as  fol¬ 
lows: 

Wijexp(-ajt/ihj(xi)) 

Wij+i  = - .  (yj 

zi 

where  Zj  is  a  normalization  factor  (chosen  so  that 
1  wi,j  + 1  =  !)• 

Output  the  final  classifier: 

H  (®)  =  sign  (e  Ujhj(x )j  .  (10) 


Figure  3:  The  AdaBoost  algorithm.  This  descrip¬ 
tion  is  largely  copied  from  [21]. 

that  H  is  equivalent  to  the  combination  of  Fout  and 
DRd'.  if,  for  three  objects  q,  a,  b  £  X,  H  predicts  that 
q  is  closer  to  a  than  it  is  to  b,  then,  under  distance 
measure  DRd,  Fout(q)  is  closer  to  F0ut(o)  than  it  is  to 

Tout (6). 

The  key  idea  is  establishing  a  duality  between  embed¬ 
dings  and  binary  classifiers.  This  duality  allows  us  to  con¬ 
vert  ID  embeddings  to  classifiers,  combine  those  classifiers 
using  AdaBoost,  and  convert  the  combined  classifier  into  a 
high-dimensional  embedding. 

7.  CONSTRUCTING  EMBEDDINGS  VIA  AD¬ 
ABOOST 

The  AdaBoost  algorithm  is  shown  in  Figure  3.  AdaBoost 
assumes  that  we  have  a  “weak  learner”  module,  which  we 
can  call  at  each  round  to  obtain  a  new  weak  classifier.  The 
goal  is  to  construct  a  strong  classifier  that  achieves  much 
higher  accuracy  than  the  individual  weak  classifiers. 

The  AdaBoost  algorithm  simply  determines  the  appro¬ 
priate  weight  for  each  weak  classifier,  and  then  adjusts  the 
training  weights.  The  training  weights  are  adjusted  so  that 
training  objects  that  are  misclassified  by  the  chosen  weak 
classifier  hj  get  more  weight  for  the  next  round. 

At  an  intuitive  level,  in  any  training  round,  the  high¬ 
est  training  weights  correspond  to  objects  that  have  been 
misclassified  by  many  of  the  previously  chosen  weak  clas¬ 
sifiers.  Because  of  the  training  weights,  the  weak  learner 
is  biased  towards  returning  a  classifier  that  tends  to  cor¬ 
rect  mistakes  of  previously  chosen  classifiers.  Overall,  weak 
classifiers  are  chosen  and  weighted  so  that  they  complement 
each  other.  The  ability  of  AdaBoost  to  construct  highly  ac¬ 
curate  classifiers  using  highly  inaccurate  weak  classifiers  has 
been  demonstrated  in  numerous  applications  (for  example, 
in  [24,  25]). 
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In  the  remainder  of  this  section  we  will  describe  how  we 
use  AdaBoost  to  construct  an  embedding,  and  how  exactly 
we  implement  steps  1-4  of  the  main  loop  shown  in  Figure  3. 

7.1  Adaptation  of  AdaBoost 

The  training  algorithm  for  BoostMap  follows  the  AdaBoost 
algorithm,  as  described  in  Figure  3.  The  goal  of  BoostMap 
is  to  learn  an  embedding  from  an  arbitrary  space  X  to  d- 
dimensional  Euclidean  space  Rd.  AdaBoost  is  adapted  to 
the  problem  of  embedding  construction  as  follows: 

•  Each  training  object  Oi  is  a  triple  (qi,  cn,bi)  of  objects 
in  X.  Because  of  that,  we  refer  to  Oi  not  as  a  training 
object,  but  as  a  training  triple.  The  set  Q  from  which 
training  triples  are  picked  can  be  the  entire  X3  (the 
set  of  all  triples  we  can  form  by  objects  from  X),  or  a 
more  restricted  subset  of  A3,  as  discussed  in  Section 
10. 

•  The  i-th  training  triple  ( qt,ai,bi )  is  associated  with  a 
class  label  yi.  For  BoostMap,  yt  =  Px(qi,cn,bi),  i.e. 
yi  is  the  proximity  order  of  triple  (qi,  at,bi),  as  defined 
in  Equation  1. 

•  Each  weak  classifier  hj  corresponds  to  a  ID  embedding 
F  from  A'  to  R.  In  particular,  hj  =  F  for  some  ID 
embedding  F,  where  F  is  defined  in  Equation  2. 

Also,  we  pass  to  AdaBoost  some  additional  arguments: 

•  A  set  C  C  X  of  candidate  objects.  Elements  of  C  will 
be  as  reference  objects  and  pivot  objects  to  define  ID 
embeddings. 

•  A  matrix  of  distances  from  each  c  £  C  to  each  c  £  C 
and  to  each  qi ,  cn  and  bi  included  in  one  of  the  training 
triples  in  T. 

7.2  Evaluating  Weak  Classifiers 

At  training  round  j ,  given  training  weights  Wij,  the  weak 
learner  is  called  to  provide  us  with  a  weak  classifier  hj.  In 
our  implementation,  the  weak  learner  simply  evaluates  many 
possible  classifiers,  and  many  possible  weights  for  each  of 
those  classifiers,  and  tries  to  find  the  best  classifier- weight 
combination. 

We  will  define  two  alternative  ways  to  evaluate  a  classifier 
h  at  training  round  j.  The  first  way  is  the  training  error  A: 

t 

Aj(h)  =  '^2wi}jG(h,qi,ai,bi)  ,  (11) 

i=  1 

where  G(h,  qi,  cu,  bi)  is  the  error  of  h  on  the  i-th  training 
triple,  as  defined  in  Equation  4.  Note  that  this  training 
error  is  weighted  based  on  Wij,  and  therefore  A j(h)  will 
vary  with  j,  i.e.  with  each  training  round. 

A  second  way  to  evaluate  a  classifier  h  is  suggested  in  [21]. 
The  function  Zj  ( h ,  a)  gives  a  measure  of  how  useful  it  would 
be  to  choose  hj  =  h  and  ctj  =  a  at  training  round  j: 

t 

Zj(h,a )  =  y exp(-ayih(qi ,  eg,  bi)))  .  (12) 

i= 1 

The  full  details  of  the  significance  of  Zj  can  be  found  in  [21]. 
Here  it  suffices  to  say  that  if  Zj  (F,  a)  <  1  then  choosing 
hj  =  h  and  a.j  =  a  is  overall  beneficial,  and  is  expected 
to  reduce  the  training  error.  Given  the  choice  between 


two  weighted  classifiers  ah  and  olh! ,  we  should  choose  the 
weighted  classifier  that  gives  the  lowest  Zj  value.  Given  hj, 
we  should  choose  aj  to  be  the  a  that  minimizes  Zj(hj,a). 

Finding  the  optimal  a  for  a  given  classifier  h,  and  the  Zj 
value  attained  using  that  a  are  very  common  operation  in 
our  algorithm,  so  we  will  define  shorthands  for  it: 

.4  m in  0  argmin^^^Z^/i,  a)  .  (13) 

Zmin(h,j,l)  =  min  Zj(h,a)  .  (14) 

o:G[Z,oo) 

In  the  above  equation,  j  specifies  the  training  round,  and 
l  specifies  a  minimum  value  for  a.  Amin(h,  j,  l)  returns  the 
a  that  minimizes  Zj(h,a),  subject  to  the  constraint  that 
a  >  l.  Argument  l  will  be  used  to  ensure  that  no  classifier 
has  a  negative  weight.  In  Section  7.4  we  will  use  classifier 
weights  to  define  a  weighted  L\  distance  measure  DRd  in  Rd, 
and  non-negative  weights  ensure  that  DRd  is  a  metric. 

7.3  Training  Algorithm 

At  the  end  of  the  j-th  round,  the  algorithm  has  assembled 
an  intermediate  classifier  Hj  =  i  aihi-  At  a  high  level, 
Hj  is  obtained  from  Hj- i  by  performing  one  of  the  following 
operations: 

•  Remove  one  of  the  already  chosen  weak  classifiers. 

•  Modify  the  weight  of  an  already  chosen  weak  classifier. 

•  Add  in  a  new  weak  classifier. 

First  we  check  whether  a  removal  or  a  weight  modification 
would  improve  the  strong  classifier.  If  this  fails,  we  add  in  a 
new  classifier.  Removals  and  weight  modifications  that  im¬ 
prove  the  strong  classifier  are  given  preference  over  adding  in 
a  new  classifier  because  they  do  not  increase  the  complexity 
of  the  strong  classifier. 

It  is  possible  that  some  weak  classifier  occurs  multiple 
times  in  Hj,  i.e.  that  there  exist  i,g  <  j  such  that  hi  =  hg. 
Without  loss  of  generality  we  assume  that  we  also  have  an 
alternative  representation  of  Hj,  such  that  Hj  =  J2i=!i  a'ih'i, 
such  that  if  g  7^  i  then  h'g  ^  h'i.  Kj  is  simply  the  number  of 
unique  weak  classifiers  occurring  in  Hj. 

Our  exact  implementation  of  steps  1-4  from  Figure  3  is  as 
follows: 

1.  Let  z  =  minc=i Zj(h'c,a'c). 

2.  If  z  <  1: 

•  Set  g  =  argminc=li ...,Kj_xZj{h!c,o!c). 

•  Set  hj  =  h'g,  aj  =  —a'g. 

•  Go  to  step  11. 

Comment:  If  z  <  1,  we  effectively  remove  h'g  from  the 
strong  classifier. 

3.  Let  z  =  min^i,...,^.!  Zmin(/ic,  j,  -«=)■ 

4.  If  z  <  .9999: 

•  Set  g  =  argminc=1  ^  j,  -a'c). 

•  Set  h.j  =  h'g. 

•  Set  a.j  —  Amin (,hg,j,  &g')- 

•  Go  to  step  11. 


6 


Comments:  Here  we  modify  the  weight  of  h'g,  by  adding 
Qj  to  it.  The  third  arguments  used  when  calling  Zmin 
and  Hmin  ensure  that  aj  >  —a'g,  so  that  otj+a'g  (which 
will  be  the  new  weight  of  h!g  in  Hj )  is  guaranteed  to  be 
non-negative.  Also,  note  that  we  check  if  2  <  .9999. 

In  principle,  if  z  <  1  then  this  weight  modification  is 
beneficial.  By  using  .9999  as  a  threshold  we  avoid  mi¬ 
nor  weight  modifications  with  insignificant  numerical 
impact  on  the  accuracy  of  the  strong  classifier. 

5.  Choose  randomly  Mi  reference  objects  n, . . . ,  tmx  from 
the  set  C  of  candidate  objects.  Construct  a  set  Fji  = 

{ Fr*  | i  =  1, . . . ,  Mi  of  Mi  ID  embeddings  using  those 
reference  objects,  as  described  in  Section  4.1. 

6.  Choose  randomly  a  set  Cj  =  {(*1,1,  *1,2),  Xm.,2)} 

of  m  pairs  of  elements  of  C,  and  construct  a  set  of  em¬ 
beddings  Fj2  =  {FX1,X2  |  (*1,0:2)  €  Cj},  where  FX1,X2 

is  as  defined  in  Equation  8. 

7.  Define  Fj  =  Fji  U  Fj2.  We  set  Fj  =  {F  \  F  £  Fj}. 

8.  Evaluate  A j(h)  for  each  h  £  Fj,  and  define  a  set  Hj 
that  includes  the  M2  classifiers  in  Fj  with  the  smallest 

A  j(h). 

9.  Set  hj  =  argminh€H;.  ^min  ( h ,  (h,j,  0). 

10.  Set  aj  =  Amin{hj,j,  0). 

Comment:  The  third  argument  to  Zm in  and  Amin  in 
the  last  two  steps  is  0.  This  constrains  Qj  to  be  non¬ 
negative. 

11.  Set  Zj  =  Zj(hj,  aj). 

12.  Set  training  weights  Wij+i  for  the  next  round  using 
Equation  9. 

In  step  8,  using  a  small  M2  reduces  training  time,  because 
it  lets  us  evaluate  Am in  only  for  M2  classifiers.  In  general, 
evaluating  the  weighted  training  error  Aj  for  a  classifier  h 
is  faster  (by  a  factor  of  five  to  ten  in  our  experiments)  than 
evaluating  Am in,  because  in  Am in  we  need  to  search  for  the 
optimal  value  a  that  minimizes  Zj(h,  a).  If  we  do  not  care 
about  speed,  we  should  set  M2  =  Adi  and  Mi  =  |Cj. 

The  algorithm  can  terminate  when  we  have  chosen  a  de¬ 
sired  number  of  classifiers,  or  when,  at  a  given  round  j,  we 
get  Zj  >  1,  meaning  that  we  have  failed  to  find  a  weak  clas¬ 
sifier  that  would  be  beneficial  to  add  to  the  strong  classifier. 

7.4  Training  Output:  Embedding  and  Dis¬ 
tance 

The  output  of  the  training  stage  is  a  continuous-output 
classifier  H  =  X)c=i  acTc,  where  each  Fc  is  associated  with 
a  ID  embedding  Fc.  This  classifier  has  been  trained  to  esti¬ 
mate,  for  triples  of  objects  ( q ,  a,  b),  if  q  is  closer  to  a  or  to  b. 
However,  our  goal  is  to  actually  construct  a  Euclidean  em¬ 
bedding.  Here  we  discuss  how  to  define  such  an  embedding, 
so  that  the  embedding  is  as  accurate  as  the  classifier  H  in 
estimating  for  any  triple  ( q ,  o,  b)  if  q  is  closer  to  a  or  to  b. 

Without  loss  of  generality,  we  assume  that  if  c  ^  j  then 
Fc  ^  Fj  (otherwise  we  add  aj  to  ac  and  remove  Fj  from 
H.)  Given  H,  we  define  an  embedding  Fout  :  X  — >  Rd  and 
a  distance  DRd  :  Rd  x  Rd  — >  R: 

Fout(x)  =  (Fl(x),...,F^(x))  .  (15) 


d 

DRd({u  1, ...,  ud),  (Vi, ...,  Vd))  =  ^2(ac\uc  -  Vc|)  •  (16) 

C=  1 

DRd  is  a  weighted  Manhattan  (Li)  distance  measure.  DRd 
is  a  metric,  because  the  training  algorithm  ensured  that  all 
ac’s  are  non- negative.  We  should  note  that,  in  the  imple¬ 
mentation  used  in  [2],  we  had  allowed  weights  to  also  be 
negative.  By  ensuring  that  DRd  is  a  metric,  we  can  apply  to 
the  resulting  embedding  any  additional  indexing,  clustering 
and  visualization  tools  that  are  available  for  metric  spaces. 

It  is  important  to  note  that  the  way  we  defined  Fout  and 
DRd,  if  we  apply  Equation  2  to  obtain  a  classifier  Fout  from 
Tout,  then  Fout  =  Hi,  i.e.  the  output  of  AdaBoost.  The 
proof  is  straightforward: 


Proposition  1.  Fout  =  H. 

Proof: 


Fout(q,a,b) 


Dj d (Tout  (h) •  Pout  (b))  l)  -d(F (q),  Fout  (u)) 

d 

EM Fdq)  ~  Tc(6)||  -  ac\\Fc{q)  -  Fc(a)||) 

C=  1 

d 

E(«c(||Fc(g)  -  Fc(6)||  -  || Fc{q)  -  Fc(o)|D) 

C=  1 

d 

^2(acFc(q,a,b))  =  H(q,a,b)  .  D 

C=  1 


This  equivalence  is  important,  because  it  shows  that  the 
quantity  optimized  by  the  training  algorithm  is  exactly  the 
quantity  that  we  set  out  to  optimize  in  our  problem  defini¬ 
tion,  i.e.  the  classification  error  G(Fout)  on  triples  of  objects. 

We  should  note  that  this  equivalence  between  classifier 
H  and  embedding  Fout  relies  on  the  way  we  define  DRd. 
If,  for  example  we  had  defined  DRd  as  an  L2  distance,  or 
an  unweighted  L 1  distance,  then  the  equivalence  would  no 
longer  hold. 

One  may  ask  the  following  question:  what  if  we  had  a 
very  accurate  classifier  H,  but  H  did  not  correspond  to  an 
embedding.  Could  we  use  H  directly?  To  answer  this  ques¬ 
tion,  we  should  keep  in  mind  that  our  final  goal  is  a  method 
for  producing  efficient  rankings  of  all  objects  in  a  database, 
in  approximate  order  of  similarity  to  a  query  object  q.  Any 
classifier  H  that  estimates  whether  q  is  closer  to  a  or  to  b 
defines  (given  a  query  object  q)  a  partial  order  of  database 
objects,  based  on  their  estimated  similarity  to  q.  However, 
it  is  mathematically  possible  to  design  classifiers  of  triples 
of  objects  that  do  not  define  a  total  order  for  every  q.  By 
proving  that  classifier  H  is  mathematically  equivalent  to  a 
Euclidean  embedding  Fout,  we  guarantee  that  H  defines  a 
total  order  of  database  objects  based  on  their  similarity  to 
query  object  q,  and  therefore  H  always  gives  well-defined 
similarity  rankings. 

7.5  Complexity 

Before  we  start  the  training  algorithm,  we  need  to  com¬ 
pute  a  matrix  of  distances  from  each  c  £  C  to  each  c  £  C  and 
to  each  qi ,  ai  and  bi  included  in  one  of  the  training  triples 
in  T.  This  can  often  be  the  most  computationally  expen¬ 
sive  part  of  the  algorithm,  depending  on  the  complexity  of 
computing  Dx  ■  In  addition,  at  each  training  round  we  eval¬ 
uate  Mi  classifiers.  Therefore,  the  computational  time  per 
training  round  is  O(Mit),  where  t  is  the  number  of  train- 
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ing  triples.  In  contrast,  FastMap  [10],  SparseMap  [13],  and 
MetricMap  [26]  do  not  require  training  at  all. 

Computing  the  d-dimensional  embedding  of  n  database 
objects  takes  time  O(dn).  Computing  the  d-dimensional 
embedding  of  a  query  object  takes  0(d)  time  and  requires 
0(d)  evaluations  of  Dx-  Comparing  the  embedding  of  the 
query  to  the  embeddings  of  n  database  objects  takes  time 
0(dn).  For  a  fixed  d,  these  costs  are  similar  to  those  of 
FastMap  [10],  SparseMap  [13],  and  MetricMap  [26]. 

In  the  experiments,  as  well  as  in  [2],  we  see  that  Boost- 
Map  often  yields  significantly  higher-dimensional  embed¬ 
dings  than  FastMap.  In  that  case,  embedding  the  query 
object  and  doing  comparisons  in  Euclidean  space  is  slower 
for  BoostMap.  At  the  same  time,  in  filter-and-refine  ex¬ 
periments,  BoostMap  actually  leads  to  much  faster  retrieval 
than  FastMap;  the  additional  cost  of  comparing  high-dimensional 
Euclidean  vectors  is  negligible  compared  to  the  savings  we 
get  in  the  refine  step,  using  the  superior  quality  of  BoostMap 
embeddings. 

8.  QUERY-SENSITIVE  EMBEDDINGS 

It  is  often  beneficial  to  generate,  using  BoostMap,  a  high¬ 
dimensional  embedding,  with  over  100  dimensions.  Such 
high-dimensional  embeddings  incur  the  additional  cost  of 
comparing  high-dimensional  Euclidean  vectors.  At  the  same 
time,  the  additional  accuracy  we  gain  in  high  dimensions  can 
be  desirable,  and  it  can  even  lead  to  faster  overall  retrieval  by 
reducing  p  in  a  filter-and-refine  implementation,  as  described 
in  Section  4.3.  This  effect  is  demonstrated  in  experiments 
with  BoostMap,  both  in  [2]  and  in  this  paper. 

However,  even  though  producing  a  high-dimensional  em¬ 
bedding  can  be  beneficial,  finding  nearest  neighbors  in  high 
dimensions  also  poses  the  following  problems,  as  pointed  out 
in  [1]: 


are  more  important  for  a  particular  query,  and  at  the  same 
time  setting  each  weight  while  taking  into  account  the  effects 
of  all  other  weights. 

8.1  Learning  a  Query-Sensitive  Classifier 

Learning  a  query-sensitive  distance  measure  is  still  done 
within  the  framework  of  AdaBoost,  using  a  simplified  ver¬ 
sion  of  the  alternating  decision-tree  algorithm,  described  in 
[11].  As  described  earlier,  every  ID  embedding  F  corre¬ 
sponds  to  a  classifier  F,  that  classifiers  triples  ( q ,  a,  b)  of 
objects  in  X.  The  key  idea  in  defining  query-sensitive  dis¬ 
tance  measures  is  that  F  may  do  a  good  job  only  when  q  is 
in  a  specific  region,  and  it  is  actually  beneficial  to  ignore  F 
when  q  is  outside  that  region.  In  order  to  do  that,  we  need 
another  classifier  S(q)  (which  we  call  a  splitter),  that  will 
estimate,  given  a  query  q,  whether  F  is  useful  or  not. 

More  formally,  if  A'  is  the  original  space,  suppose  we  have 
a  splitter  S  :  X  — >  {0, 1}  and  a  ID  embedding  F  :  X  — >  R. 
We  define  a  query-sensitive  classifier  Qs,f  ■  A'3  — >  R,  as 
follows: 


Qs,F(q,  a,  b)  =  S(q)F(q,  a,  b)  .  (17) 


We  say  that  the  splitter  S  accepts  q  if  S(q)  =  1,  and  S  rejects 
q  if  S(q)  =  0. 

We  can  readily  define  splitters  using  ID  embeddings.  Given 
a  ID  embedding  F  :  X  — >  R,  and  a  subset  let,  we  can 
define  a  splitter  Sf,v  '■  X  — ♦  {0,  1}  as  follows: 


Sf,v 


1 

0 


if  F(q)  €  V  . 
otherwise  . 


(18) 


We  will  use  the  notation  Qf1,v,f2  for  the  query-sensitive 
classifier  that  is  based  on  Sjq.v  and  F2: 

QF1,v,F2(q,a,b)  =  SF1:v(q)F(q,a,b)  .  (19) 


•  Lack  of  contrasting:  two  high-dimensional  objects 
are  unlikely  to  be  very  similar  in  all  the  dimensions. 

•  Statistical  sensitivity:  The  data  is  rarely  uniformly 
distributed,  and  for  a  pair  of  objects  there  may  be 
only  relatively  few  coordinates  that  are  statistically 
significant  in  comparing  those  objects. 

•  Skew  magnification:  Many  attributes  may  be  cor¬ 
related  with  each  other. 

BoostMap  produces  a  high-dimensional  embedding  that 
preserves  more  of  the  proximity  structure  of  the  original 
space,  as  compared  to  lower-dimensional  embeddings.  In 
that  sense,  distances  measured  in  high  dimensions  are  more 
meaningful  than  distances  measured  in  low  dimensions,  since 
our  goal  is  accurate  similarity  rankings  with  respect  to  dis¬ 
tances  in  the  original  non-Euclidean  space.  At  the  same 
time,  the  three  problems  outlined  above  are  still  present, 
in  the  sense  that  we  can  achieve  even  better  accuracy  by 
addressing  those  problems  in  our  formulation. 

To  address  those  problems,  we  extend  the  BoostMap  al¬ 
gorithm  so  that  it  produces  a  query-sensitive  distance  mea¬ 
sure.  By  “query-sensitive”  we  mean  that  the  weights  used 
for  the  weighted  In  distance  will  not  be  fixed,  as  defined  in 
Equation  16.  Instead,  they  will  depend  on  the  query  object. 
An  automatically  chosen  query-sensitive  distance  measure 
provides  a  principled  way  to  address  the  three  problems  de¬ 
scribed  in  [1],  by  putting  more  emphasis  on  coordinates  that 


Suppose  that  the  algorithm  described  in  Section  7  has 
produced  a  d-dimensional  embedding  Fout  :  X  — ♦  Rd,  such 
that  -F011t  (a:)  =  (Fi(*), . . . ,  Fd(x)).  We  will  introduce  a  sec¬ 
ond  training  phase,  that  starts  after  Fout  has  been  produced, 
and  adds  a  query-sensitive  component  to  F0ut.  This  query- 
sensitive  component  will  be  a  combination  of  query-sensitive 
classifiers  Qfc,v,fs,  where  Fc  and  Fg  are  parts  of  Fout,  and 

v  ci. 

Let  J  be  the  number  of  training  rounds  it  took  to  pro¬ 
duce  classifier  H  as  described  in  Section  7,  and  let  H  = 
acFc  .  For  training  round  j  >  J  (i.e.  for  a  training 
round  of  the  second  training  phase,  that  builds  the  query- 
sensitive  component),  we  perform  the  following  steps: 

1.  For  each  c  =  1 , ,d,  pick  g  randomly  from  1 , ,d, 
so  that  with  0.5  probability  g  =  c.  Define  Tc  =  Fg. 

2.  For  each  c  =  1, ...  ,d,  define  a  set  Vc  of  subsets  of 
R,  such  that  each  V  £  Vc  is  of  the  form  R,  (— 00,  t), 
(t,  00),  (tijti),  (— oo,ti)  U  (f2,oo)  (where  t,ti,t2  are 
real  numbers). 

Comment:  Each  V  £  Vc  will  be  combined  with  Tc  to 
define  a  splitter.  Choosing  sets  V  can  be  done  by  look¬ 
ing  at  the  set  of  values  {rc (<?i)  :  i  =  1, . . . ,  t},  where  qt 
is  the  first  object  of  the  i-th  training  triple,  and  picking 
thresholds  t,ti,t2  randomly  from  those  values. 

3.  For  each  c  =  1 ,...  ,d,  set: 

Vc  =  argminvev (Zmin(Qrc,v,Fc ,  j,  0)). 


Comment:  here,  given  ID  embeddings  Tc  and  Fc,  we 
find  the  range  Vc  €  V  that  leads  to  the  best  classifier 
Qrc,v,Fc  (he.  the  classifier  that  attains  the  lowest  Zj 
value) . 

4.  Set  g  =  argminc=1 d(Zmin(Qrc,vc,Fc,  j,  0)). 

Comment:  here  we  find  the  g  £  {1, . . .  ,d}  for  which 
the  corresponding  classifier  Qr „,v„,Fg  is  the  best. 

5.  Set  hj  =  Qrg,vg,Fg- 

6.  Set  otj  —  Aunn(hj .  j .  0). 

Comment:  Note  that  the  third  argument  to  Amin  and 
Zmin  in  all  steps  has  been  0.  This  constraints  ay  to  be 
non-negative. 

7.  Set  Zj  =  Zj{hj,  aj). 

8.  Set  training  weights  Wij+i  for  the  next  round  using 
Equation  9. 

Comment:  the  last  three  steps  are  identical  to  the  last 
three  steps  of  the  algorithm  in  Section  7.3. 

Note  that  the  first  round  of  the  second  training  phase, 
which  is  training  round  J  +  1  overall,  uses  training  weights 
wyj+i,  as  they  were  set  by  the  last  training  round  (round  J) 
of  the  first  training  phase  (which  was  described  in  Section 

7). 

If  H  was  the  output  of  the  first  training  phase,  and  if 
the  second  training  phase  was  executed  in  training  rounds 
J  +  1, . . . ,  J2,  we  write  the  output  H2  of  the  second  training 
phase  as  follows: 

H2  =  H+  Y  ache  ■  (20) 

C=J+  1 

We  expect  H2  to  be  more  accurate  than  H,  because  it 
includes  query-sensitive  classifiers,  each  of  which  is  focused 
on  a  specific  region  of  possible  queries.  In  particular,  for 
each  classifier  Fc  used  in  H,  the  second  training  phase  tries 
to  identify  subsets  of  queries  where  Fc  is  particularly  useful, 
and  increases  the  weight  of  Fc  for  those  queries. 

8.2  Defining  a  Query- Sensitive  Embedding 

In  Section  7.4  we  used  H  to  define  an  embedding  Fou t, 
that  maps  objects  of  A'  into  Rd,  and  a  distance  DRd,  such 
that  Fout  was  equivalent  to  H.  Now  that  we  have  con¬ 
structed  H2,  we  also  to  define  an  embedding  and  a  distance 
based  on  H2.  The  embedding  associated  with  H2  is  still 
Tout,  since  H2  only  uses  ID  embeddings  that  also  occur  in 
H.  On  the  other  hand,  we  cannot  define  a  global  distance 
measure  in  Rd  anymore  that  would  make  Fout  equivalent 
to  H2.  To  achieve  this  equivalence  between  F0 ut  and  H2, 
we  define  a  query-sensitive  distance  Dq ,  that  depends  on 
the  query  object  q.  First,  we  define  an  auxiliary  function 
Ac(q),  which  assigns  a  weight  to  the  c-th  coordinate,  for 
c  =  1, . . . ,  d: 

Ac(q)  =  ac  +  Y,  •  (21) 

g-ge{  J+l,...,  J2}^hg=Qs,Fc  A(S(g)=l) 

In  words,  for  coordinate  c,  we  go  through  all  the  query- 
sensitive  weak  classifiers  that  were  chosen  at  the  second 
training  phase.  Each  such  query-sensitive  classifier  hg  can 


be  written  as  Qs,f ■  We  check  if  the  splitter  S  accepts  q , 
and  if  F  =  Fc.  If  those  conditions  are  satisfied,  we  add  the 
weight  ag  to  ac. 

If  T’out(g)  =  (qi, ...,  qd),  and  x  is  some  other  object  in  X, 
with  Fout(x)  =  (xi ,...,Xd),  we  define  query-sensitive  dis¬ 
tance  Dq  as  follows: 

d 

Dq((qi,-,qd),  (*1,  ...,Xd))  =  ^(Ac(g)|gc  -  xc\)  ■  (22) 

C=  1 

Now,  using  this  distance  Dq,  with  a  slight  modification  of 
Equation  2,  we  can  define  Fovit.2  in  such  a  way  that  Tout, 2  = 
H2-. 

Eout,2(f/,  3.1,  £2)  =  Dq  (Rout  (q)  ,  .Tout  (^2)) 

Dq (Fout  (<?),  Tout  (®i))  •  (23) 

The  only  difference  from  Equation  2  is  that  here  we  use  the 
query-sensitive  distance  measure  Dq,  as  opposed  to  a  global 
distance  measure  DRd. 

We  omit  the  proof  that  Tout, 2  =  H2,  it  is  pretty  straight¬ 
forward  and  follows  the  same  steps  as  the  proof  of  Propo¬ 
sition  1.  The  fact  that  Fout  and  H2  establishes  that,  if  the 
query-sensitive  classifier  H2  is  more  accurate  than  classifier 
H,  then  we  using  the  query-sensitive  distance  Dq  will  lead 
to  more  accurate  results.  This  is  demonstrated  in  our  ex¬ 
periments. 

8.3  Complexity  of  Query-Sensitive  Embeddings 

To  learn  a  query-sensitive  embedding  we  have  to  perform 
additional  training  using  AdaBoost.  The  actual  number  of 
classifiers  Q  we  can  define  using  embeddings  in  H 1  can  be 
quite  large,  since  we  can  form  such  a  classifier  for  each  pair 
of  embeddings  occurring  in  Hi  and  each  choice  of  a  range 
R.  We  can  keep  training  time  manageable  because  of  two 
reasons: 

•  We  noted  that,  in  early  implementations,  AdaBoost 
tended  to  choose  query  sensitive  classifiers  ( Q)f,f,r , 
i.e.  classifiers  that  used  the  same  ID  embedding  twice 
F.  The  interpretation  of  that  is  that  F(q)  tends  to  con¬ 
tain  significant  information  about  whether  F  is  useful 
on  triples  of  type  ( q,a,b ).  Based  on  that  observation, 
in  our  current  implementation  we  have  AdaBoost  con¬ 
sider  all  classifiers  ( Q)s,f  in  which  S  is  defined  us¬ 
ing  F ,  and  an  equal  number  of  classifiers  (chosen  ran¬ 
domly)  where  S  is  not  defined  using  F . 

•  For  each  pair  of  ID  embeddings  F\  and  F2,  we  use  that 
pair  to  define  a  large  number  of  ( Q)s,f2  classifiers,  by 
defining  a  splitter  S  based  on  F\  and  any  of  a  large 
number  of  possible  ranges  R.  However,  the  training 
errors  of  all  those  classifiers  ( Q)s,f2  are  related,  since 
the  only  thing  that  is  different  among  those  classifiers 
is  the  range  R  £  R  of  the  splitter  S.  If  r  is  the  number 
of  ranges  we  are  willing  to  consider,  and  t.  is  the  number 
of  training  triples  used  in  AdaBoost,  we  can  measure 
the  training  errors  of  all  ranges  in  time  0(t  +  r),  as 
opposed  to  the  time  0(tr)  it  would  take  if  we  evaluated 
all  those  errors  independently  of  each  other.  Based  on 
that,  at  each  training  round,  given  Fi  and  F2  we  can 
quickly  evaluate  a  large  number  of  ranges  and  choose 
the  range  that  has  the  smallest  training  error. 
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At  retrieval  time,  given  a  query,  using  a  query-sensitive 
distance  measure  incurs  negligible  additional  cost  over  using 
a  query-insensitive  distance  measure.  The  only  additional 
computation  we  need  is  in  order  to  compute  the  Ac(q)  val¬ 
ues,  i.e.  the  query-specific  weights  of  each  coordinate  j.  This 
can  be  done  by  scanning  the  classifier  H2  once.  The  size  of 
classifier  H2  in  practice  is  comparable  to  the  dimensionality 
of  the  embedding.  The  total  cost  of  this  scanning  is  marginal 
compared  to  the  cost  of  evaluating  distances  between  the 
embedding  of  q  and  the  embedding  of  each  database  object. 

9.  OPTIMIZING  EMBEDDINGS  FOR  CLAS¬ 
SIFICATION 

When  we  have  a  database  of  objects  in  some  non-Euclidean 
or  even  non-metric  space  X,  Euclidean  embeddings  can  be 
used  for  indexing,  in  order  to  efficiently  identify  the  nearest 
neighbors  of  a  query  in  the  original  space  X.  However,  in 
many  applications,  our  ultimate  goal  is  not  retrieving  near¬ 
est  neighbors,  but  actually  classifying  the  query,  using  the 
known  class  information  of  the  query’s  nearest  neighbors. 
This  section  describes  how  we  can  optimize  embeddings  di¬ 
rectly  for  classification  accuracy. 

9.1  Hidden  Parameter  Space 

As  elsewhere  in  this  paper,  let  X  be  an  arbitrary  space,  in 
which  we  define  a  (possibly  non- metric)  distance  Dx-  Here 
we  also  assume  that  there  is  an  additional  distance  defined 
on  X,  which  we  denote  as  $x-  We  will  call  Dx  the  feature 
space  distance  and  we  will  call  $x  the  hidden  parameter 
space  distance,  or  simply  the  parameter  space  distance. 

Our  experimental  dataset,  the  MNIST  database,  provides 
an  example.  The  MNIST  database  consists  of  60,000  train¬ 
ing  images  and  10,000  test  images  of  handwritten  digits. 
Some  of  those  images  are  shown  in  Figure  4.  Each  image 
shows  one  of  the  10  possible  digits,  from  0  to  9.  One  can  de¬ 
fine  various  distance  measures  on  this  space  of  hand  images, 
like  the  non- metric  chamfer  distance  [5],  or  shape  context  [6]. 
Given  a  query  image  q  ,  we  want  to  find  the  nearest  neigh¬ 
bor  (or  k-nearest  neighbors,  for  some  k)  of  the  query,  and 
using  the  class  labels  of  those  neighbors  we  want  to  classify 
the  query. 

In  this  case,  the  feature  space  distance  Dx  is  a  distance 
measure  like  the  chamfer  distance  or  shape  context,  that 
depends  on  object  features.  Given  two  objects,  we  can  al¬ 
ways  observe  those  features,  and  therefore  we  can  always 
evaluate  Dx-  The  hidden  parameters  in  this  case  are  the 
class  labels  of  the  objects,  which  we  cannot  directly  observe 
but  we  want  to  estimate.  The  hidden  parameters  are  known 
for  the  database  objects  (the  training  images),  but  not  for 
the  query  objects.  We  define  the  hidden  parameter  space 
distance  <f>x  between  two  objects  x,  y  £  X  to  be  0  if  those 
two  objects  have  the  same  class  labels  (i.e.  are  pictures  of 
the  same  digit),  and  1  if  the  two  objects  have  different  class 
labels. 

There  are  also  domains  where  <f>x  is  not  a  binary  distance. 
For  example,  a  dataset  in  which  we  evaluated  the  original 
BoostMap  algorithm  consisted  of  hand  images  [2] .  In  those 
images,  we  used  the  chamfer  distance  as  the  feature  space 
distance,  but  the  goal  was  to  actually  estimate  the  hand  pose 
in  the  query  image.  Hand  pose  is  a  continuous  space,  defined 
by  articulated  joint  angles  and  global  3D  orientation  of  the 
hand.  In  this  case  one  can  define  a  distance  <I>x  between 


hand  poses.  Since  hand  poses  are  not  known  for  the  query 
images,  they  are  hidden  parameters. 

When  our  goal  is  to  estimate  the  hidden  parameters  of  the 
query  object,  the  only  use  of  feature  space  distance  measure 
Dx  is  that  lets  us  perform  this  estimation,  using  nearest- 
neighbor  classification.  Suppose  now  that  we  map  database 
objects  into  a  Euclidean  space,  using  BoostMap  for  exam¬ 
ple,  and  we  have  a  choice  of  two  distance  measures,  Di  and 
Z?2,  to  use  in  the  Euclidean  space.  For  illustration  pur¬ 
poses,  let’s  make  an  extreme  assumption  that  D\  perfectly 
preserves  distances  Dx,  and  D2  is  a  bad  approximation  of 
Dx,  but  it  leads  to  higher  classification  accuracy  than  Dx 
(and  therefore  than  D\).  In  that  case,  if  classification  is  our 
goal,  Z?2  would  be  preferable  over  D\ . 

9.2  Tuning  BoostMap  for  Classification 

Euclidean  embeddings,  like  FastMap,  MetricMap,  Lips- 
chitz  embeddings,  and  BoostMap,  provide  a  Euclidean  sub¬ 
stitute  for  the  feature  space  distance  Dx,  which  itself  is 
a  substitute  for  the  hidden  parameter  space  distance  $x, 
which  cannot  be  evaluated  exactly.  Therefore,  the  distance 
measure  used  in  the  Euclidean  space  is  two  levels  of  approx¬ 
imation  away  from  <I>a'  ,  which  is  the  measure  we  really  want 
to  estimate.  However,  since  BoostMap  is  actually  trained 
using  machine  learning,  we  can  easily  modify  it  so  that  it  is 
directly  optimized  for  classification,  i.e.  for  recovering  <f>x- 

Optimizing  BoostMap  for  classification  accuracy  is  pretty 
straightforward.  As  described  in  the  overview  of  the  train¬ 
ing  algorithm,  each  training  triple  ( qi ,  m,  hf)  has  a  class  label 
yi  =  Px(q,x 1,2:2).  Note  that  the  definition  of  Px  in  Equa¬ 
tion  1  is  based  on  an  underlying  distance  measure  D  between 
objects  of  A'.  If  our  goal  is  approximating  Dx,  then  we  set 
D  =  Dx-  If  our  goal  is  nearest-neighbor  classification,  we 
use  D  =  4>x- 

If  we  have  a  training  triple  (qi,  ai,  bf)  where  q  is  closer  to 
b  using  measure  Dx  but  closer  to  a  using  measure  <f>x ,  then 
defining  class  label  yi  using  4>x  will  steer  the  training  algo¬ 
rithm  towards  trying  to  produce  an  embedding  Eout  and  a 
distance  Dq  such  that  F)q  (Tout  (q'),  Tout  (a))  <  Dq(Fout(q),Fout(b)). 
Overall,  the  training  algorithm  will  try  to  map  objects  from 
the  same  class  closer  to  each  other  than  to  objects  of  other 
classes. 

10.  CHOOSING  TRAINING  TRIPLES 

In  the  original  implementation  and  experimental  evalu¬ 
ation  of  BoostMap  in  [2],  training  triples  were  chosen  at 
random.  With  a  random  training  set  of  triples,  BoostMap 
tries  to  preserve  the  entire  similarity  structure  of  the  orig¬ 
inal  space  A'.  This  means  that  the  resulting  embedding 
is  equally  optimized  for  nearest  neighbor  queries,  farthest 
neighbor  queries,  or  median  neighbor  queries.  In  cases  where 
we  only  care  about  nearest  neighbor  queries,  we  would  ac¬ 
tually  prefer  an  embedding  that  gave  more  accurate  results 
for  such  queries,  even  if  such  an  embedding  did  not  preserve 
other  aspects  of  the  similarity  structure  of  X,  like  farthest- 
neighbor  information. 

If  we  want  to  construct  an  embedding  for  the  purpose  of 
answering  nearest  neighbor  queries,  then  we  can  construct 
training  triples  in  a  more  selective  manner.  The  main  idea 
is  that,  given  an  object  q,  we  should  form  a  triple  (q,  a,  b) 
where  both  a  and  b  are  relatively  close  to  q. 

In  our  experiments  with  the  MNIST  database,  having  in 
mind  that  we  also  wanted  to  optimize  embeddings  for  classi- 
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fication  accuracy,  we  choose  training  triples  as  follows:  first, 
we  specify  the  desired  number  t  of  training  triples  to  pro¬ 
duce,  and  an  integer  k'  that  specifies  up  to  how  “far”  m  and 
bi  can  be  from  qi  in  each  triple  ( qt,di,bi ).  Then,  we  choose 
the  i-th  training  triple  ( qi,di,bi )  as  follows: 

1.  Choose  a  random  training  object  qi. 

2.  Choose  an  integer  k  in  1 , ...  ,k' . 

3.  Choose  a,i  to  be  the  k-nearest  neibhor  of  qi  among  all 
training  objects  for  which  <f >(di,qi)  =  0.  This  way  di 
has  the  same  class  label  as  qi. 

4.  Choose  a  number  r  in  9 k,  ■  ■  ■  ,9k  +  9.  Note  that  10  is 
the  number  of  classes  in  the  MNIST  database,  and  9  is 
the  number  of  classes  that  are  different  than  the  class 
of  qi. 

5.  Choose  bi  to  be  the  r-nearest  neighbor  of  qi  among  all 
training  objects  whose  class  label  is  different  than  the 
class  label  of  qi. 

Essentially,  each  training  triple  contains  an  object  q,  one 
of  its  nearest  neighbors  among  objects  of  the  same  class  as 
q,  and  one  of  the  nearest  neighbors  among  objects  of  all 
classes  different  than  the  class  of  q.  If  we  did  not  care  about 
classification,  we  could  simply  have  chosen  random  triples 
such  that  d  and  b  are  among  the  k'  nearest  neighbors  of  q. 

The  reason  we  choose  r  to  be  roughly  9  times  bigger  than 
k  is  that,  with  that  choice,  it  is  reasonable  to  assume  that 
most  of  the  times  Dx(qi,di)  will  be  smaller  than  Dx{qi ,  bi). 
In  general,  if  M  is  the  number  of  classes,  even  if  Dx  carries 
zero  information  about  <f?x,  we  still  expect  that  on  average 
the  k-nearest  neighbor  of  q  among  same-class  objects  will 
have  the  same  rank  as  the  k(M  —  l)-nearest  neighbor  of  q 
among  objects  that  belong  to  different  classes.  For  this  as¬ 
sumption  to  hold,  we  just  need  to  have  the  same  number  of 
objects  in  each  class,  and  Dx  to  be  not  worse  than  a  ran¬ 
dom  distance  measure  for  k-nearest  neighbor  classification. 
These  are  very  weak  assumptions.  At  the  same  time,  if  we 
learn  an  embedding  that,  for  a  large  percentage  of  q  objects 
and  k  values,  maps  q  closer  to  its  k-nearest  neighbor  among 
same-class  objects  than  to  the  k(M  —  l)-nearest  neighbor 
among  objects  of  different  classes,  then  we  expect  that  em¬ 
bedding  to  lead  to  high  nearest-neighbor  classification  accu¬ 
racy.  So,  setting  r  =  M  —  1  we  expect  the  weak  classifiers  to 
be  better  than  random  classifiers,  and  the  accuracy  of  the 
strong  classifier  on  triples  is  related  to  k-nn  classification 
accuracy  on  query  objects  using  the  embedding. 

If  our  goal  is  not  classification,  but  simply  to  provide  ac¬ 
curate  indexing  for  nearest  neighbor  retrieval,  the  method 
outlined  above  for  choosing  training  triples  would  still  be 
useful.  In  each  triple  ( q,a,b ),  a  and  b  are  both  relatively 
close  to  q,  and  therefore  the  training  set  of  triples  biases  the 
training  algorithm  to  focus  on  preserving  k-nearest-neighbor 
structure,  for  small  values  of  k,  as  opposed  to  preserving 
similarity  structure  in  general. 

11.  EXPERIMENTS 

We  compared  BoostMap  to  FastMap  [10]  on  the  MNIST 
dataset  of  handwritten  digits,  which  is  described  in  [16], 
and  is  publicly  available  on  the  web.  This  dataset  consists 
of  60,000  training  images,  which  we  used  as  our  database, 
and  10,000  images,  which  we  used  as  queries.  Some  of  those 


Figure  4:  Some  examples  from  the  MNIST  database 
of  handwritten  images. 


images  can  be  seen  in  Figure  4.  We  used  the  symmetric 
chamfer  distance  [5]  as  the  underlying  distance  measure. 
The  chamfer  distance  is  non-metric,  because  it  violates  the 
triangle  inequality. 

For  BoostMap,  we  always  used  200,000  triples  for  training. 
The  objects  appearing  in  those  triples  came  from  a  set  of 
5000  database  objects.  The  size  of  C,  the  set  of  candidate 
objects  defined  in  Section  7.1,  was  also  5000.  We  used  Mi  = 
1000,  and  M2  =  200.  FastMap  was  run  on  a  distance  matrix 
produced  using  10,000  training  objects. 

For  BoostMap  we  have  tested  different  variants,  in  order 
to  evaluate  the  advantages  of  the  three  extensions  intro¬ 
duced  in  this  paper:  choosing  training  triples  in  a  selective 
way,  vs.  choosing  them  randomly,  using  a  query-sensitive 
distance  measure  vs.  using  a  global  distance  measure  in  Eu¬ 
clidean  space,  and  optimizing  BoostMap  for  classification 
vs.  optimizing  BoostMap  for  preserving  proximity  struc¬ 
ture.  To  denote  each  of  these  variants,  we  use  the  following 
abbreviations: 

Fe  Embedding  optimized  for  approximating  feature-space 
distances  Dx,  as  opposed  to  Pa. 

Pa  Embedding  optimized  for  approximating  parameter-space 
distances  4>x ,  i.e.  for  classification  accuracy,  as  op¬ 
posed  to  Fe. 

QI  Query-insensitive  distance  measure  Dmd  is  used,  as  de¬ 
fined  in  Section  7.4.  The  alternative  to  Qi  is  QS. 

QS  Query-sensitive  distance  measure  Dq  is  used,  as  op¬ 
posed  to  QI. 

Ra  Training  triples  are  chosen  entirely  randomly  from  the 
set  of  all  possible  triples,  as  opposed  to  Se. 

Se  Training  triples  are  chosen  selectively,  from  a  restricted 
set  of  possible  triples,  as  described  in  Section  10.  The 
alternative  to  Se  is  Ra. 

For  example,  BoostMap-Fe-Se-QS  means  that  the  em¬ 
bedding  was  optimized  for  approximating  feature-space  dis¬ 
tances,  we  chose  training  triples  selectively,  and  we  used  a 
query-sensitive  distance  measure. 

11.1  Measures  of  Embedding  Accuracy 

To  evaluate  the  accuracy  of  the  approximate  similarity 
ranking  for  a  query,  we  use  a  measure  that  we  call  exact 
k-nearest  neighbor  rank  (ENN-k  rank),  defined  as  follows: 
given  query  object  q,  and  integer  k,  let  bi,...,bk  be  the 
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Figure  5:  Plots  of  the  95th  and  99th  percentile  of  ENN-k  ranks  attained  by  FastMap,  BoostMap-Fe-Ra-QI,  and 
BoostMap-Fe-Se-QI,  for  different  embedding  dimensions,  for  k  =  1,  10,  100. 
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k-nearest  neighbors  of  q  in  the  database,  under  the  exact 
distance  Dx-  Given  an  embedding  F,  the  rank  of  any  bi 
under  the  embedding  is  defined  to  be  one  plus  the  number 
of  database  objects  that  F  maps  closer  to  F(q)  than  F(bi) 
is  to  F(q).  Then,  the  ENN-k  rank  for  q  under  embedding  F 
is  the  worst  rank  attained  by  any  object  in  bi, . . . , 

For  example,  for  k  =  10,  an  ENN-10  rank  of  150  using 
BoostMap  and  16  dimensions  means  that  all  10  exact  k- 
nearest  neighbors  of  q  were  within  the  150  nearest  neighbors 
of  the  query,  as  computed  using  a  16-dimensional  BoostMap 
embedding.  Using  filter-and-refine  retrieval,  if  we  keep  150 
or  more  candidates  after  the  filter  step,  we  will  successfully 
identify  all  10  nearest  neighbors  of  q  at  the  refine  step. 

If  we  have  a  set  of  queries,  then  we  can  look  at  different 
percentiles  of  the  ENN-k  ranks  attained  for  those  queries. 
For  example,  given  embedding  F,  a  value  of  3400  for  the 
95th  percentile  of  ENN-10  ranks  means  that,  for  95%  of  the 
10,000  query  objects,  the  ENN-10  rank  was  3400  or  less. 

Another  measure  of  accuracy  for  an  embedding  F  is  sim¬ 
ply  the  k-nearest  neighbor  classification  error  using  F.  Given 
query  object  q,  we  identify  the  k-nearest  neighbors  of  F(q) 
in  the  embedding  of  the  database.  Each  of  those  k  objects 
gives  a  vote  for  its  class  label.  The  class  label  that  receives 
the  most  votes  is  assigned  to  q.  If  two  or  more  classes  receive 
the  same  number  of  votes,  we  find,  for  each  class  y  among 
those  classes,  the  database  object  xVlQ  that  belongs  to  class 
y  and  is  the  nearest  to  q,  and  we  choose  the  class  y  for  which 
the  corresponding  xy,q  is  the  closest  to  q.  If  there  is  still  a 
tie,  we  break  it  by  choosing  at  random. 

11.2  BoostMap  vs.  FastMap 

Figure  5  shows  the  95th  and  99tli  percentiles  of  ENN-k 
attained  by  FastMap,  BoostMap-Fe-Ra-QI,  and  BoostMap- 
Fe-Se-QI,  for  different  embedding  dimensions,  for  k  =  1,  10, 
100.  We  note  that  in  fewer  than  16  dimensions,  FastMap 
sometimes  gives  the  best  results.  From  16  dimensions  and 
on,  BoostMap-Fe-Ra-QI  (which  is  essentially  the  BoostMap 
variant  described  in  [2])  gives  better  results  than  FastMap. 
BoostMap-Fe-Se-QI  does  worse  for  lower  dimensions,  but  at 
256  dimensions  it  gives  the  best  results  in  all  cases.  These 
facts  also  hold  for  other  values  of  k  (up  to  100)  and  other 
percentiles  (from  80%  to  99%)  that  we  have  checked. 

The  results  demonstrate  that,  in  a  filter-and-refine  frame¬ 
work,  using  BoostMap-Fe-Ra-QI  we  typically  need  to  keep 
significantly  fewer  candidate  matches  after  the  filter  step, 
and  overall  we  can  find  the  correct  top  k  nearest  neighbors 
evaluating  far  fewer  exact  distances  Dx ,  compared  to  using 
FastMap.  As  k  and  the  percentile  increase,  at  some  point  it 
becomes  beneficial  to  use  BoostMap-Fe-Se-QI. 

For  example,  if  we  want  to  retrieve  the  true  10  nearest 
neighbors  ( k  =  10)  for  98%  of  the  query  objects,  these  are 
the  optimal  results  we  get  for  the  three  different  methods: 

FastMap:  We  get  the  best  result  for  11  dimensions,  and 
keeping  5523  database  objects  after  the  filter  step.  We 
need  to  compute  22  distances  Dx  to  embed  each  query, 
and  5523  distances  Dx  to  find  the  10  nearest  neigh¬ 
bors.  In  total,  we  compute  5545  Dx  distances. 

BoostMap-Fe-Ra-QI:  We  get  the  best  result  for  256  di¬ 
mensions,  and  keeping  1698  database  objects  after  the 
filter  step.  We  need  to  compute  at  most  512  distances 
Dx  to  embed  each  query  (for  some  dimensions  we  need 
one  Dx  evaluation,  for  some  dimensions  we  need  two 
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k 

percentile 

random 

selective 

random 

selective 

1 

95 

77 

20 

38 

20 

1 

99 

349 

73 

136 

62 

10 

95 

949 

330 

502 

273 

10 

99 

2483 

1010 

1302 

675 

100 

95 

6220 

4406 

3773 

3215 

100 

99 

11617 

10508 

7437 

7333 

Table  1:  Comparison  of  the  two  methods  for  choosing 
training  triples:  sampling  them  from  the  set  of  all  possi¬ 
ble  triples  vs.  choosing  them  from  a  selective  subset  of 
triples.  For  256-dimensional  embeddings,  we  show  the 
95th  and  99-th  percentiles  of  ENN-k  ranks,  for  k  =  1, 

10,  100. 

Dx  evaluations),  and  1698  distances  Dx  to  find  the 
10  nearest  neighbors.  In  total,  we  compute  at  most 
2210  Dx  distances. 

BoostMap-Fe-Se-QI:  We  get  the  best  result  for  256  di¬ 
mensions,  and  keeping  637  database  objects  after  the 
filter  step.  We  need  to  compute  at  most  512  distances 
Dx  to  embed  each  query,  and  637  distances  Dx  to 
find  the  10  nearest  neighbors.  In  total,  we  compute  at 
most  1149  Dx  distances. 

In  domains  where  computing  Dx  distances  is  the  compu¬ 
tational  bottleneck,  and  computing  distances  in  256-dimensional 
Euclidean  space  is  relatively  fast,  results  like  the  above  mean 
that  BoostMap  leads  to  significantly  more  efficient  filter- 
and-refine  retrieval. 

At  this  point,  we  have  only  trained  256-dimensional  query- 
sensitive  embeddings,  so  we  do  not  have  enough  data  to 
include  query-sensitive  embeddings  to  the  plots  of  Figure 
5.  Later  in  this  section  we  show  that  using  query-sensitive 
embeddings  further  improves  embedding  quality. 

11.3  Random  vs.  Selective  Training  Triples 

Figure  5  shows  the  ENN-k  ranks  attained  by  BoostMap- 
Fe-Se-QI  vs.  BoostMap-Fe-Ra-QI  for  different  percentiles. 

We  note  that,  for  lower  dimensions,  choosing  training  triples 
from  a  restricted  set  seems  to  lead  to  less  accurate  em¬ 
beddings.  On  the  other  hand,  at  256  dimensions,  choosing 
triples  selectively  leads  to  more  accurate  embeddings. 

One  possible  interpretation  of  these  results  is  that,  by 
choosing  triples  selectively,  the  training  algorithm  optimizes 
the  embedding  so  that  it  is  highly  accurate  on  those  triples, 
but  not  necessarily  on  other  triples.  If  each  training  triple 
(q,  a ,  b)  is  such  that  m  and  bi  are  close  to  qi,  the  training  will 
not  consider  triples  ( q ,  a,  b')  where  b  is  farther  away  from 
q  (for  example,  cases  where  b'  is  not  in  the  1000  nearest 
neighbors  of  q). 

For  example,  suppose  that  we  want  to  retrieve  the  10  near¬ 
est  neighbors  a\ , .  . . ,  oio  of  q  in  the  original  space  X  with 
distance  measure  Dx  ■  An  ideal  embedding  Fldeai  would  map 
q  closer  to  those  10  neighbors  than  to  any  other  object.  The 
ENN-10  rank  that  is  achieved  by  an  embedding  F  for  object 
q  increases  because  of  objects  bi  such  that  F(q)  is  closer  to 
bi  than  it  is  to  one  of  the  ten  nearest  neighbors  at.  Choos¬ 
ing  training  triples  (q,  a,  b)  so  that  b  is,  say,  within  the  1000 
nearest  neighbors  of  q,  we  make  the  implicit  assumption  that 
objects  outside  the  1000  nearest  neighbors  of  q  will  not  cause 
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Figure  6:  Comparison  of  the  two  methods  for  choosing  training  triples:  sampling  them  from  the  set  of  all  possible 
triples  vs.  choosing  them  from  a  selective  subset  of  triples.  We  plot  ENN-k  ranks  vs.  percentile,  for  256-dimensional 
embeddings,  for  k  =  1,  10,  100.  On  the  left  we  show  query-insensitive  embeddings,  on  the  right  we  show  query-sensitive 
embeddings.  In  all  cases,  and  for  all  percentiles,  choosing  triples  selectively  leads  to  better  results. 


Figure  7:  Comparison  of  query-insensitive  versus  query-sensitive  embeddings.  We  plot  ENN-k  ranks  vs.  percentile,  for 
256-dimensional  embeddings,  for  k  =  1,  10,  100.  On  the  left  we  show  embeddings  learned  using  random  training  triples, 
on  the  right  we  show  embeddings  learned  using  selective  training  triples.  In  most  cases  query-sensitive  embeddings 
give  similar  or  better  results,  compared  to  query-insensitive  embeddings,  when  all  other  settings  are  fixed. 
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k 

Figure  8:  K-nearest  neighbor  classification  error  using 
query-insensitive  and  query-sensitive  embeddings.  For 
k  =  1. . .  . ,  100  we  show  the  corresponding  k-nn  error  rate. 

problems,  i.e.  we  expect  that  the  embedding  F  will  not  map 
q  closer  to  any  of  those  distant  objects  than  to  q’s  10  near¬ 
est  neighbors.  That’s  why  we  want  the  training  algorithm 
to  focus  more  on  triples  where  b  is  close  to  q. 

This  assumption  is  obviously  violated  in  lower-dimensional 
embeddings,  which  are  not  very  accurate  and  they  can  map 
distant  objects  close  to  each  other.  In  those  cases,  choosing 
random  triples  forces  the  training  algorithm  to  try  to  pre¬ 
serve  the  overall  proximity  structure  of  the  space,  whereas 
choosing  triples  so  that  a  and  b  are  close  to  q  means  that 
the  training  algorithm  does  not  penalize  choices  that  map 
distant  objects  close  to  each  other. 

In  high  dimensions,  it  is  much  more  rare  for  distant  ob¬ 
jects  to  map  close  to  each  other,  and  then  the  main  source 
of  indexing  errors  is  objects  that  are  somewhat  close  to  q. 
Training  BoostMap  with  selective  triples  optimizes  the  em¬ 
bedding  so  as  to  avoid  that  type  of  indexing  errors,  so  overall 
we  get  higher  embedding  quality. 

At  this  point,  this  interpretation  is  just  a  hypothesis.  We 
need  additional  experiments,  in  which  we  vary  the  parame¬ 
ter  k'  (defined  in  Section  10)  that  specifies  how  close  a  and 
b  are  to  q  for  each  training  triple.  If  larger  k'  values  lead 
to  higher  accuracy  in  lower  dimensions,  that  would  provide 
supporting  evidence  for  our  interpretation.  In  the  experi¬ 
ments  reported  here,  we  used  k'  =  4. 

To  demonstrate  the  advantages  of  choosing  triples  selec¬ 
tively  for  high-dimensional  embeddings,  we  compare  the  two 
methods  of  choosing  training  triples  in  Figure  6  and  Table  1. 
The  results  demonstrate  that,  in  256  dimensions,  choosing 
triples  selectively  leads  to  better  embedding  quality. 

11.4  Query-Sensitive  vs.  Query-Insensitive  Em¬ 
beddings 

To  evaluate  the  benefits  of  query-sensitive  embeddings 
(i.e.  BoostMap  embeddings  that  use  query-sensitive  dis¬ 
tance  measures)  we  trained,  for  different  settings,  query- 
sensitive  256-dimensional  embeddings.  Figure  7  and  Table 
2  compare  query  sensitive  embeddings  to  query-insensitive 


BoostMap-Fe-Ra 

BoostMap-Fe-Se 

k 

percentile 

random 

selective 

random 

selective 

1 

95 

77 

38 

20 

20 

1 

99 

349 

136 

73 

62 

10 

95 

949 

502 

330 

273 

10 

99 

2483 

1302 

1010 

675 

100 

95 

6220 

3773 

4406 

3215 

100 

99 

11617 

7437 

10508 

7333 

Table  2:  Comparison  of  the  two  methods  for  choosing 
training  triples:  sampling  them  from  the  set  of  all  possi¬ 
ble  triples  vs.  choosing  them  from  a  selective  subset  of 
triples.  For  256-dimensional  embeddings,  we  show  the 
95th  and  99-th  percentiles  of  ENN-k  ranks,  for  k  =  1, 
10,  100. 

embeddings,  based  on  different  percentiles  of  ENN-k  ranks. 
We  see  that,  in  most  cases,  the  query-sensitive  variants  give 
similar  or  better  results  than  the  query-insensitive  variants, 
and  in  some  cases  the  results  are  significantly  better. 

We  also  evaluate  query-sensitive  embeddings  based  on 
the  k-nn  classification  error  rate  attained  using  embeddings 
optimized  for  classification  (parameter-space  embeddings). 
Figure  8  shows  the  corresponding  results  for  256-dimensional 
parameter-space  embeddings.  For  all  values  of  k  that  we 
tested,  the  query-sensitive  embedding  had  lower  error  rate 
than  the  query-insensitive  embedding. 

11.5  Parameter- Space  vs.  Feature-Space  Em¬ 
beddings 

As  discussed  in  Section  9,  we  expect  parameter-space  em¬ 
beddings  to  be  worse  than  feature-space  embeddings  with 
respect  to  preserving  the  feature-space  distance  Dx,  but  at 
the  same  time  we  expect  parameter-space  embeddings  to 
give  higher  classification  accuracy  than  feature-space  em¬ 
beddings.  Figure  9  shows  percentiles  of  ENN-k  ranks,  and 
Figure  10  shows  the  k-nn  error  rates  achieved  for  different 
values  of  k,  for  feature-space  and  parameter-space  embed¬ 
dings.  These  results  agree  with  our  expectations. 

Figure  11  and  Table  3  compare  the  classification  error 
rates  achieved  using  the  original  chamfer  distance  and  using 
a  256-dimensional  parameter-space  query-sensitive  embed¬ 
ding.  It  is  interesting  to  note  that  for  most  values  of  k 
the  embedding  actually  achieves  a  lower  error  rate  than  the 
original  distance  measure.  The  chamfer  distance  achieves 
the  best  overall  error  rate,  but  it  is  only  marginally  better 
than  the  best  error  rate  achieved  using  the  embedding:  for 
k  —  5,  the  chamfer  distance  misclassified  463  images  and  the 
embedding  misclassified  468  images,  out  of  10,000  objects. 

In  domains  where  we  get  such  results,  we  actually  do  not 
need  to  apply  filter-and-refine  retrieval  in  order  to  do  k-nn 
classification,  since  we  get  equally  good  results  using  near¬ 
est  neighbors  in  the  Euclidean  space.  When  computing  dis¬ 
tances  in  the  original  space  is  the  computational  bottleneck, 
using  a  parameter-space  embedding  can  speed  up  recogni¬ 
tion  significantly,  since  we  only  need  to  compute  a  few  hun¬ 
dreds  of  distances  in  the  original  space,  in  order  to  compute 
the  embedding  of  the  query  object. 

FastMap,  MetricMap,  SparseMap,  Bourgain  embeddings, 
and  the  original  formulation  of  BoostMap,  usually  provide 
approximations  of  an  original  distance  measure,  that  lead  to 
more  efficient  distance  computations  but  less  accuracy.  Us- 
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Figure  9:  Comparison  of  feature-space  versus  parameter-space  embeddings,  with  respect  to  ENN-k  ranks.  We  plot 
ENN-k  ranks  vs.  percentile,  for  256-dimensional  embeddings,  for  k  =  1,  10,  100.  On  the  left  we  show  query-insensitive 
embeddings,  on  the  right  we  show  query-sensitive  embeddings.  Feature-space  embeddings  give  somewhat  better  results 
for  ENN-k  ranks. 


Figure  10:  Comparison  of  feature-space  versus  parameter-space  embeddings,  with  respect  to  k-nn  classification  error 
rate.  We  plot  k-nn  error  rate  versus  k.  On  the  left  we  show  query-insensitive  embeddings,  on  the  right  we  show 
query-sensitive  embeddings.  Parameter-space  embeddings  give  lower  error  rates. 
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Figure  11:  Comparison  of  k-nn  error  rates  using  orig¬ 
inal  distance  measure,  and  using  a  256-dimensional 
parameter-space  query-sensitive  embedding.  We  plot  k- 
nn  error  rate  vs.  k. 


chamfer  distance 

BoostMap-Pa-Se-QS 

1-nn 

0.0547 

0.0535 

Best  k-nn 

0.0463  (k  =  5) 

0.0468  (k  =  5) 

Table  3:  Comparison  of  k-nn  error  rates  using  orig¬ 
inal  distance  measure,  and  using  a  256-dimensional 
parameter-space  query-sensitive  embedding.  We  show 
the  error  rates  for  1-nn,  and  for  the  value  of  k  that 
achieved  the  lowest  k-nn  error  rate  (the  best  k  equals 
5  in  both  cases). 


ing  parameter-space  BoostMap  embeddings  it  is  possible  in 
some  domains  to  obtain  both  faster  distance  measures  and 
higher  classification  accuracy.  For  the  MNIST  database  and 
using  the  chamfer  distance,  we  get  a  Euclidean  approxima¬ 
tion  of  the  chamfer  distance  that  achieves  similar  accuracy. 
It  will  be  interesting  to  see  if  we  get  similar  results  in  other 
domains. 

12.  DISCUSSION 

The  experimental  results  reported  in  this  paper  provide  a 
thorough  evaluation  of  different  BoostMap  variants  on  the 
MNIST  dataset  using  the  chamfer  distance  as  the  underlying 
distance  measure.  However,  experiments  with  more  datasets 
and  comparisons  with  more  existing  methods  are  needed  in 
order  to  have  a  clear  picture  of  the  relative  advantages  and 
disadvantages  of  BoostMap. 

The  main  disadvantage  of  BoostMap  is  the  running  time 
of  the  training  algorithm.  At  the  same  time,  in  [2]  and  in 
this  paper  we  have  successfully  completed  the  training  for 
large  datasets  and  with  computationally  expensive  distance 
measures,  so  we  expect  BoostMap  to  be  applicable  in  a  wide 
range  of  domains.  Furthermore,  in  many  applications,  the 
training  time  can  be  an  acceptable  cost  as  long  as  it  leads  to 
a  higher-quality  embedding,  and  significantly  faster  nearest 
neighbor  retrieval  and  k-nn  classification. 

The  main  advantage  of  BoostMap  is  that  it  is  formu¬ 
lated  as  a  classifier-combination  problem,  so  that  we  can 
take  advantage  of  powerful  machine  learning  techniques  to 
construct  a  highly  accurate  embedding  from  many  simple, 
ID  embeddings.  Our  problem  definition,  that  treats  em¬ 
beddings  as  classifiers,  leads  to  an  embedding  construction 
method  that  can  be  applied  in  any  space,  metric  or  non¬ 
metric,  without  assuming  any  property  of  the  space,  except 
for  expecting  ID  embeddings  to  behave  as  weak  classifiers. 
In  contrast,  FastMap  makes  strong  Euclidean  assumptions 
that,  in  our  experiments,  are  useful  only  for  low-dimensional 
embeddings.  Bourgain  embeddings  make  weaker  assump¬ 
tions,  but  they  still  assume  that  the  underlying  space  is 
metric. 

The  machine-learning  formulation  also  allows  us  to  de¬ 
fine  query-sensitive  embeddings  and  parameter-space  em¬ 
beddings,  which  are  shown  in  the  experiments  to  improve 
overall  embedding  quality  for  both  nearest  neighbor  retrieval 
and  nearest-neighbor  classification  accuracy.  There  are  no 
obvious  modifications  to  FastMap,  Bourgain  embeddings, 
and  other  related  methods,  that  could  yield  query-sensitive 
or  parameter-space  embeddings.  Posing  embedding  con¬ 
struction  as  minimization  of  a  well-defined  classification  er¬ 
ror  provides  us  with  great  flexibility  in  deciding  exactly  what 
we  want  to  optimize  for,  and  how  to  achieve  that  optimiza¬ 
tion. 
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